Avoid concurrent snapshot finalizations when deleting an INIT snapshot #28078

tlrx · 2018-01-04T12:41:16Z

With the current snapshot/restore logic, a newly created snapshot is added by
the SnapshotService.createSnapshot() method as a SnapshotInProgress object in
the cluster state. This snapshot has the INIT state. Once the cluster state
update is processed, the beginSnapshot() method is executed using the SNAPSHOT
thread pool.

The beginSnapshot() method starts the initialization of the snapshot using the
initializeSnapshot() method. This method reads the repository data and then
writes the global metadata file and an index metadata file per index to be
snapshotted. These operations can take some time to be completed (it could
be many minutes).

At this stage and if the master node is disconnected the snapshot can be stucked
in INIT state on versions 5.6.4/6.0.0 or lower (pull request #27214 fixed this on
5.6.5/6.0.1 and higher).

If the snapshot is not stucked but the initialization takes some time and the
user decides to abort the snapshot, a delete snapshot request can sneak in. The
deletion updates the cluster state to check the state of the SnapshotInProgress.
When the snapshot is in INIT, it executes the endSnapshot() method (which returns
immediately) and then the snapshot's state is updated to ABORTED in the cluster
state. The deletion will then listen for the snapshot completion in order to
continue with the deletion of the snapshot.

But before returning, the endSnapshot() method added a new Runnable to the
SNAPSHOT thread pool that forces the finalization of the initializing snapshot. This
finalization writes the snapshot metadata file and updates the index-N file in
the repository.

At this stage two things can potentially be executed concurrently: the initialization
of the snapshot and the finalization of the snapshot. When the initializeSnapshot()
is terminated, the cluster state is updated to start the snapshot and to move it to
the STARTED state (this is before #27931 which prevents an ABORTED snapshot to be
started at all). The snapshot is started and shards start to be snapshotted but they
quickly fail because the snapshot was ABORTED by the deletion. All shards are
reported as FAILED to the master node, which executes endSnapshot() too (using
SnapshotStateExecutor).

Then many things can happen, depending on the execution of tasks by the SNAPSHOT
thread pool and the time taken by each read/write/delete operation by the repository
implementation. Especially on S3, where operations can take time (disconnections,
retries, timeouts) and where the data consistency model allows to read old data or
requires some time for objects to be replicated.

Here are some scenario seen in cluster logs:

a) the snapshot is finalized by the snapshot deletion. Snapshot metadata file exists
in the repository so the future finalization by the snapshot creation will fail with
a "fail to finalize snapshot" message in logs. Deletion process continues.

b) the snapshot is finalized by the snapshot creation. Snapshot metadata file exists
in the repository so the future finalization by the snapshot deletion will fail with
a "fail to finalize snapshot" message in logs. Deletion process continues.

c) both finalizations are executed concurrently, things can fail at different read or
write operations. Shards failures can be lost as well as final snapshot state, depending
on which SnapshotInProgress.Entry is used to finalize the snapshot.

d) the snapshot is finalized by the snapshot deletion, the snapshot in progress is
removed from the cluster state, triggering the execution of the completion listeners.
The deletion process continues and the deleteSnapshotFromRepository() is executed using
the SNAPSHOT thread pool. This method reads the repository data, the snapshot metadata
and the index metadata for all indices included in the snapshot before updated the index-N
file from the repository. It can also take some time and I think these operations could
potentially be executed concurrently with the finalization of the snapshot by the snapshot
creation, leading to corrupted data.

This commit does not solve all the issues reported here, but it removes the finalization
of the snapshot by the snapshot deletion. This way, the deletion marks the snapshot as
ABORTED in cluster state and waits for the snapshot completion. It is the responsibility
of the snapshot execution to detect the abortion and terminates itself correctly. This
avoids concurrent snapshot finalizations and also ordinates the operations: the deletion
aborts the snapshot and waits for the snapshot completion, the creation detects the abortion
and stops by itself and finalizes the snapshot, then the deletion resumes and continues
the deletion process.

Closes #27974

tlrx · 2018-01-04T12:44:31Z

@imotov I'd be happy to have your opinion on this. The reported scenarios are very hard to reproduce but I think they can happen, specially using single-node clusters and S3. A second look on this would be very helpful.

imotov

Nice catch! I think it would make sense to bake it on master for a it before backporting to make sure this is the only issue in this test.

ywelsch

LGTM, I've added a few small suggestions.

ywelsch · 2018-01-05T09:40:49Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                        // snapshot is still initializing, mark it as aborted
+                        shards = snapshotEntry.shards();
+
+                    } else if (snapshotEntry.state() == State.STARTED) {


state == State.STARTED?
Otherwise no need to define the local variable state above

ywelsch · 2018-01-05T09:49:08Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

                                accepted = true;
                            }
                        } else {
-                            entries.add(entry);
+                            failure = "snapshot state changed to " + entry.state() + " during initialization";


add assert entry.state() == State.ABORTED here. You can directly write the message as "snapshot was aborted during initialization" which makes it clearer which situation is handled here.

ywelsch · 2018-01-05T09:58:37Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

                                accepted = true;
                            }
                        } else {
-                            entries.add(entry);
+                            failure = "snapshot state changed to " + entry.state() + " during initialization";
+                            updatedSnapshot = entry;


It was not really updated. That's just set here so that endSnapshot is called below. Maybe instead of updatedSnapshot and accepted variables we should have an endSnapshot variable that captures the snapshot to end.

…snapshot With the current snapshot/restore logic, a newly created snapshot is added by the SnapshotService.createSnapshot() method as a SnapshotInProgress object in the cluster state. This snapshot has the INIT state. Once the cluster state update is processed, the beginSnapshot() method is executed using the SNAPSHOT thread pool. The beginSnapshot() method starts the initialization of the snapshot using the initializeSnapshot() method. This method reads the repository data and then writes the global metadata file and an index metadata file per index to be snapshotted. These operations can take some time to be completed (many minutes). At this stage and if the master node is disconnected the snapshot can be stucked in INIT state on versions 5.6.4/6.0.0 or lower (pull request elastic#27214 fixed this on 5.6.5/6.0.1 and higher). If the snapshot is not stucked but the initialization takes some time and the user decides to abort the snapshot, a delete snapshot request can sneak in. The deletion updates the cluster state to check the state of the SnapshotInProgress. When the snapshot is in INIT, it executes the endSnapshot() method (which returns immediately) and then the snapshot's state is updated to ABORTED in the cluster state. The deletion will then listen for the snapshot completion in order to continue with the deletion of the snapshot. But before returning, the endSnapshot() method added a new Runnable to the SNAPSHOT thread pool that forces the finalization of the initializing snapshot. This finalization writes the snapshot metadata file and updates the index-N file in the repository. At this stage two things can potentially be executed concurrently: the initialization of the snapshot and the finalization of the snapshot. When the initializeSnapshot() is terminated, the cluster state is updated to start the snapshot and to move it to the STARTED state (this is before elastic#27931 which prevents an ABORTED snapshot to be started at all). The snapshot is started and shards start to be snapshotted but they quickly fail because the snapshot was ABORTED by the deletion. All shards are reported as FAILED to the master node, which executes endSnapshot() too (using SnapshotStateExecutor). Then many things can happen, depending on the execution of tasks by the SNAPSHOT thread pool and the time taken by each read/write/delete operation by the repository implementation. Especially on S3, where operations can take time (disconnections, retries, timeouts) and where the data consistency model allows to read old data or requires some time for objects to be replicated. Here are some scenario seen in cluster logs: a) the snapshot is finalized by the snapshot deletion. Snapshot metadata file exists in the repository so the future finalization by the snapshot creation will fail with a "fail to finalize snapshot" message in logs. Deletion process continues. b) the snapshot is finalized by the snapshot creation. Snapshot metadata file exists in the repository so the future finalization by the snapshot deletion will fail with a "fail to finalize snapshot" message in logs. Deletion process continues. c) both finalizations are executed concurrently, things can fail at different read or write operations. Shards failures can be lost as well as final snapshot state, depending on which SnapshotInProgress.Entry is used to finalize the snapshot. d) the snapshot is finalized by the snapshot deletion, the snapshot in progress is removed from the cluster state, triggering the execution of the completion listeners. The deletion process continues and the deleteSnapshotFromRepository() is executed using the SNAPSHOT thread pool. This method reads the repository data, the snapshot metadata and the index metadata for all indices included in the snapshot before updated the index-N file from the repository. It can also take some time and I think these operations could potentially be executed concurrently with the finalization of the snapshot by the snapshot creation, leading to corrupted data. This commit does not solve all the issues reported here, but it removes the finalization of the snapshot by the snapshot deletion. This way, the deletion marks the snapshot as ABORTED in cluster state and waits for the snapshot completion. It is the responsability of the snapshot execution to detect the abortion and terminates itself correctly. This avoids concurrent snapshot finalizations and also ordinates the operations: the deletion aborts the snapshot and waits for the snapshot completion, the creation detects the abortion and stops by itself and finalizes the snapshot, then the deletion resumes and continues the deletion process.

tlrx · 2018-01-08T14:05:10Z

Thanks @imotov and @ywelsch !

I think it would make sense to bake it on master for a it before backporting to make sure this is the only issue in this test.

I agree with Igor. I'm going to label this change as "pending backport" and wait few days fore more runs on CI.

* master: Use Gradle wrapper when building BWC Painless: Add a simple cache for whitelist methods and fields. (elastic#28142) Fix upgrading indices which use a custom similarity plugin. (elastic#26985) Fix Licenses values for CDDL and Custom URL (elastic#27999) Cleanup TcpChannelFactory and remove classes (elastic#28102) Fix expected plugins test for transport-nio [Docs] Fix Date Math example descriptions (elastic#28125) Fail rollover if duplicated alias found in template (elastic#28110) Avoid concurrent snapshot finalizations when deleting an INIT snapshot (elastic#28078) Deprecate `isShardsAcked()` in favour of `isShardsAcknowledged()` (elastic#27819) [TEST] Wait for replicas to be allocated before shrinking Use the underlying connection version for CCS connections (elastic#28093) test: do not use asn fields Test: Add assumeFalse for test that cannot pass on windows Clarify reproduce info on Windows Remove out-of-date projectile file

* master: (27 commits) Declare empty package dirs as output dirs Consistent updates of IndexShardSnapshotStatus (elastic#28130) Fix Gradle wrapper usage on Windows when building BWC (elastic#28146) [Docs] Fix some typos in comments (elastic#28098) Use Gradle wrapper when building BWC Painless: Add a simple cache for whitelist methods and fields. (elastic#28142) Fix upgrading indices which use a custom similarity plugin. (elastic#26985) Fix Licenses values for CDDL and Custom URL (elastic#27999) Cleanup TcpChannelFactory and remove classes (elastic#28102) Fix expected plugins test for transport-nio [Docs] Fix Date Math example descriptions (elastic#28125) Fail rollover if duplicated alias found in template (elastic#28110) Avoid concurrent snapshot finalizations when deleting an INIT snapshot (elastic#28078) Deprecate `isShardsAcked()` in favour of `isShardsAcknowledged()` (elastic#27819) [TEST] Wait for replicas to be allocated before shrinking Use the underlying connection version for CCS connections (elastic#28093) test: do not use asn fields Test: Add assumeFalse for test that cannot pass on windows Clarify reproduce info on Windows Remove out-of-date projectile file ...

#28078) This commit removes the finalization of a snapshot by the snapshot deletion request. This way, the deletion marks the snapshot as ABORTED in cluster state and waits for the snapshot completion. It is the responsability of the snapshot execution to detect the abortion and terminates itself correctly. This avoids concurrent snapshot finalizations and also ordinates the operations: the deletion aborts the snapshot and waits for the snapshot completion, the creation detects the abortion and stops by itself and finalizes the snapshot, then the deletion resumes and continues the deletion process.

tlrx · 2018-01-15T13:50:01Z

I didn's see any CI failure related to this change so I backported it to 6.2. and 6.1.3.

@imotov Do you think it should be backported to 6.0.2 and 5.6 too?

imotov · 2018-01-15T13:57:09Z

Personally, I would love to see this in 5.6. I think this bug really affects the user experience when it comes to snapshots on unstable clusters. We are unlikely to ever release 6.0.2, but if you will backport it to 5.6, it probably makes sense to backport it to 6.0 branch just in case.

#28078) This commit removes the finalization of a snapshot by the snapshot deletion request. This way, the deletion marks the snapshot as ABORTED in cluster state and waits for the snapshot completion. It is the responsability of the snapshot execution to detect the abortion and terminates itself correctly. This avoids concurrent snapshot finalizations and also ordinates the operations: the deletion aborts the snapshot and waits for the snapshot completion, the creation detects the abortion and stops by itself and finalizes the snapshot, then the deletion resumes and continues the deletion process.

tlrx · 2018-01-15T16:03:07Z

Thanks @imotov, that confirms my opinion.

I backported it along with #27931 to 6.0.2 (6af4bed) and to 5.6.7 in (ed19090).

tlrx · 2018-01-15T20:26:43Z

Sorry, I mixed up labels. This was merged in 5.6.7 and 6.1.3.

tlrx requested a review from imotov January 4, 2018 12:41

tlrx added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.0.3 v6.1.2 v7.0.0 labels Jan 4, 2018

tlrx added the review label Jan 4, 2018

imotov approved these changes Jan 4, 2018

View reviewed changes

tlrx requested a review from ywelsch January 4, 2018 15:36

ywelsch approved these changes Jan 5, 2018

View reviewed changes

tlrx added 2 commits January 8, 2018 13:35

Apply feedback

5065cac

tlrx force-pushed the no-end-snapshot-on-deletions branch from b707ea5 to 5065cac Compare January 8, 2018 12:35

tlrx merged commit 04ce0e7 into elastic:master Jan 8, 2018

tlrx added backport pending and removed v6.0.3 v6.1.2 labels Jan 8, 2018

tlrx mentioned this pull request Jan 10, 2018

[CI] StackOverflowError when executing SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization #28169

Closed

tlrx added v6.2.0 v6.1.2 and removed backport pending labels Jan 15, 2018

tlrx mentioned this pull request Jan 15, 2018

Do not start snapshots that are deleted during initialization #27931

Merged

tlrx added v5.6.6 v6.0.2 v5.6.7 v6.1.3 and removed v5.6.6 v6.1.2 labels Jan 15, 2018

colings86 added the >bug label Jan 22, 2018

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

tlrx deleted the no-end-snapshot-on-deletions branch October 8, 2019 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid concurrent snapshot finalizations when deleting an INIT snapshot #28078

Avoid concurrent snapshot finalizations when deleting an INIT snapshot #28078

tlrx commented Jan 4, 2018

tlrx commented Jan 4, 2018

imotov left a comment

ywelsch left a comment

ywelsch Jan 5, 2018

ywelsch Jan 5, 2018

ywelsch Jan 5, 2018

tlrx commented Jan 8, 2018

tlrx commented Jan 15, 2018 •

edited

Loading

imotov commented Jan 15, 2018

tlrx commented Jan 15, 2018 •

edited

Loading

tlrx commented Jan 15, 2018

Avoid concurrent snapshot finalizations when deleting an INIT snapshot #28078

Avoid concurrent snapshot finalizations when deleting an INIT snapshot #28078

Conversation

tlrx commented Jan 4, 2018

tlrx commented Jan 4, 2018

imotov left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jan 5, 2018

Choose a reason for hiding this comment

ywelsch Jan 5, 2018

Choose a reason for hiding this comment

ywelsch Jan 5, 2018

Choose a reason for hiding this comment

tlrx commented Jan 8, 2018

tlrx commented Jan 15, 2018 • edited Loading

imotov commented Jan 15, 2018

tlrx commented Jan 15, 2018 • edited Loading

tlrx commented Jan 15, 2018

tlrx commented Jan 15, 2018 •

edited

Loading

tlrx commented Jan 15, 2018 •

edited

Loading